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Abstract 

We present a self-training approach to unsu¬ 
pervised dependency parsing that reuses exist¬ 
ing supervised and unsupervised parsing algo¬ 
rithms. Our approach, called ‘iterated rerank¬ 
ing’ (IR), starts with dependency trees gener¬ 
ated by an unsupervised parser, and iteratively 
improves these trees using the richer proba¬ 
bility models used in supervised parsing that 
are in turn trained on these trees. Our system 
achieves 1.8% accuracy higher than the state- 
of-the-part parser of Spitkovsky et al. (2013) 
on the WSJ corpus. 

1 Introduction 

Unsupervised dependeney parsing and its super¬ 
vised eounterpart have many eharaeteristics in com¬ 
mon: they take as input raw sentences, produce 
dependency structures as output, and often use the 
same evaluation metric (DDA, or UAS, the percent¬ 
age of tokens for which the system predicts the cor¬ 
rect head). Unsurprisingly, there has been much 
more research on supervised parsing - producing a 
wealth of models, datasets and training techniques 
- than on unsupervised parsing, which is more dif¬ 
ficult, much less accurate and generally uses very 
simple probability models. Surprisingly, however, 
there have been no reported attempts to reuse super¬ 
vised approaches to tackle the unsupervised parsing 
problem (an idea briefly mentioned in Spitkovsky et 
al. (2010b)). 

There are, nevertheless, two aspects of supervised 
parsers that we would like to exploit in an unsuper¬ 
vised setting. First, we can increase the model ex¬ 


pressiveness in order to capture more linguistic reg¬ 
ularities. Many recent supervised parsers use third- 
order (or higher order) features (Koo and Collins, 
2010; Martins et al., 2013; Le and Zuidema, 2014) 
to reach state-of-the-art (SOTA) performance. In 
contrast, existing models for unsupervised parsing 
limit themselves to using simple features (e.g., con¬ 
ditioning on heads and valency variables) in order 
to reduce the computational cost, to identify consis¬ 
tent patterns in data (Naseem, 2014, page 23), and 
to avoid overfitting (Blunsom and Cohn, 2010). Al¬ 
though this makes learning easier and more efficient, 
the disadvantage is that many useful linguistic regu¬ 
larities are missed: an upper bound on the perfor¬ 
mance of such simple models - estimated by us¬ 
ing annotated data - is 76.3% on the WSJ corpus 
(Spitkovsky et al., 2013), compared to over 93% ac¬ 
tual performance of the SOTA supervised parsers. 

Second, we would like to make use of informa¬ 
tion available from lexical semantics, as in Bansal 
et al. (2014), Le and Zuidema (2014), and Chen and 
Manning (2014). Lexical semantics is a source for 
handling rare words and syntactic ambiguities. For 
instance, if a parser can identify that “he” is a depen¬ 
dent of “walks” in the sentence “He walks”, then, 
even if “she” and “runs” do not appear in the train¬ 
ing data, the parser may still be able to recognize 
that “she” should be a dependent of “runs” in the 
sentence “she runs”. Similarly, a parser can make 
use of the fact that “sauce” and “John” have very 
different meanings to decide that they have different 
heads in the two phrases “ate spaghetti with sauce” 
and “ate spaghetti with John”. 

However, applying existing supervised parsing 



techniques to the task of unsupervised parsing is, 
unfortunately, not trivial. The reason is that those 
parsers are optimally designed for being trained on 
manually annotated data. If we use existing unsuper¬ 
vised training methods (like EM), learning could be 
easily misled by a large amount of ambiguity natu¬ 
rally embedded in unannotated training data. More¬ 
over, the computational cost could rapidly increase 
if the training algorithm is not designed properly. 
To overcome these difficulties we propose a frame¬ 
work, iterated reranking (IR), where existing super¬ 
vised parsers are trained without the need of manu¬ 
ally annotated data, starting with dependency trees 
provided by an existing unsupervised parser as ini¬ 
tialiser. Using this framework, we can employ the 
work of Le and Zuidema (2014) to build a new sys¬ 
tem that outperforms the SOTA unsupervised parser 
of Spitkovsky et al. (2013) on the WSJ corpus. 

The contribution of this paper is twofold. First, 
we show the benefit of using lexical semantics for 
the unsupervised parsing task. Second, our work is 
a bridge connecting the two research areas unsuper¬ 
vised parsing and its supervised counterpart. Before 
going to the next section, in order to avoid confusion 
introduced by names, it is worth noting that we use 
un-trained existing supervised parsers which will be 
trained on automatically annotated treebanks. 

2 Related Work 

2.1 Unsupervised Dependency Parsing 

The first breakthrough was set by Klein and Man¬ 
ning (2004) with their dependency model with va¬ 
lence (DMV), the first model to outperform the 
right-branching baseline on the DDA metric: 43.2% 
vs 33.6% on sentences up to length 10 in the WSJ 
corpus. Nine years later, Spitkovsky et al. (2013) 
achieved much higher DDAs: 72.0% on sentences 
up to length 10, and 64.4% on all sentences in sec¬ 
tion 23. During this period, many approaches have 
been proposed to attempt the challenge. 

Naseem and Barzilay (2011), Tu and Honavar 
(2012), Spitkovsky et al. (2012), Spitkovsky et al. 
(2013), and Marecek and Straka (2013) employ ex¬ 
tensions of the DMV but with different learning 
strategies. Naseem and Barzilay (2011) use seman¬ 
tic cues, which are event annotations from an out- 
of-domain annotated corpus, in their model during 


training. Relying on the fact that natural language 
grammars must be unambiguous in the sense that 
a sentence should have very few correct parses, Tu 
and Honavar (2012) incorporate unambiguity regu- 
larisation to posterior probabilities. Spitkovsky et al. 
(2012) bootstrap the learning by slicing up all input 
sentences at punctuation. Spitkovsky et al. (2013) 
propose a complete deterministic learning frame¬ 
work for breaking out of local optima using count 
transforms and model recombination. Marecek and 
Straka (2013) make use of a large raw text corpus 
(e.g., Wikipedia) to estimate stop probabilities, us¬ 
ing the reducibility principle. 

Differing from those works. Bisk and Hocken- 
maier (2012) rely on Combinatory Categorial Gram¬ 
mars with a small number of hand-crafted general 
linguistic principles; whereas Blunsom and Cohn 
(2010) use Tree Substitution Grammars with a hi¬ 
erarchical non-parametric Pitman-Yor process prior 
biasing the learning to a small grammar. 

2.2 Reranking 

Our work relies on reranking which is a technique 
widely used in (semi-)supervised parsing. Rerank¬ 
ing requires two components: a /c-best parser and a 
reranker. Given a sentence, the parser generates a 
list of k best candidates, the reranker then rescores 
those candidates and picks the one that has the high¬ 
est score. Reranking was first successfully applied to 
supervised constituent parsing (Collins, 2000; Char- 
niak and Johnson, 2005). It was then employed in 
the supervised dependency parsing approaches of 
Sangati et al. (2009), Hayashi et al. (2013), and Le 
and Zuidema (2014). 

Closest to our work is the work series on semi- 
supervised constituent parsing of McClosky and col¬ 
leagues, e.g. McClosky et al. (2006), using self¬ 
training. They use a fc-best generative parser and 
a discriminative reranker to parse unannotated sen¬ 
tences, then add resulting parses to the training 
treebank and re-train the reranker. Different from 
their work, our work is for unsupervised dependency 
parsing, without manually annotated data, and uses 
iterated reranking instead of single reranking. In 
addition, both two components, /c-best parser and 
reranker, are re-trained after each iteration. 



3 The IR Framework 

Existing training methods for the unsupervised de- 
pendeney task, sueh as Blunsom and Cohn (2010), 
Gillenwater et al. (2011), and Tu and Honavar 
(2012), are hypothesis-oriented seareh with the EM 
algorithm or its variants: training is to move from 
a point whieh represents a model hypothesis to an¬ 
other point. This approaeh is feasible for optimising 
models using simple features sinee existing dynamie 
programming algorithms ean eompute expeetations, 
whieh are sums over all possible parses, or to find 
the best parse in the whole parse spaee with low 
eomplexities. However, the eomplexity inereases 
rapidly if rieh, eomplex features are used. One way 
to reduee the eomputational eost is to use approx¬ 
imation methods like sampling as in Blunsom and 
Cohn (2010). 

3.1 Treebank-oriented Greedy Search 

Believing that the diffieulty of using EM is from 
the faet that treebanks are ‘hidden’, leading to the 
need of eomputing sum (or max) overall possible 
treebanks, we propose a greedy loeal seareh seheme 
based on another training philosophy: treebank- 
oriented seareh. The key idea is to explieitly seareh 
for eonerete treebanks whieh are used to train pars¬ 
ing models. This scheme thus allows supervised 
parsers to be trained in an unsupervised parsing set¬ 
ting since there is a (automatically annotated) tree- 
bank at any time. 

Given S a set of raw sentences, the search space 
consists of all possible treebanks V = {d(s)|s £ 5} 
where d{s) is a dependency tree of sentence s. The 
target of search is the optimal treebank V* that is as 
good as human annotations. Greedy search with this 
philosophy is as follows: starting at an initial point 
Vi, we pick up a point V 2 among its neighbours 
N {Vi) such that 

T >2 = arg max (V) (1) 

©eN(©i) 

where is an objective function measuring 

the goodness of V (which may or may not be con¬ 
ditioned onVi). We then continue this search until 
some stop criterion is satisfied. The crucial faclor 
here is fo define N('Dj) and Below are fwo 

special cases of fhis scheme. 


Semi-supervised parsing using reranking (Mc- 
Closky el ah, 2006). This reranking is indeed one- 
slep greedy local search. In Ibis scenario, N(T>i) is 
Ihe Cartesian producl of /c-besl lisls generaled by a 
/c-besl parser, and fx>^ (22) is a reranker. 

Unsupervised parsing with hard-EM 

(Spilkovsky el ah, 2010b) In hard-EM, Ihe lar- 
gel is fo maximise Ihe following objecfive function 
wilh respecf fo a parameter sel 0 

L{S\&) = 'S^ max log (2) 

f^deDepis) 

where Dep{s) is Ihe sel of all possible dependency 
slruclures of s. The fwo EM sleps are Ihus 

• Step 1: Vi+i = arg maxp (22) 

• Step 2: 0j+i = argmaxe P0(22j+i) 

In Ibis case, N(22j) is Ihe whole Ireebank space and 

= ^’argmaxePef®*)^^)- 

3.2 Iterated Reranking 

We instantiate the greedy search scheme by iterated 
reranking which requires two components: a fc-best 
parser P, and a reranker R. Eirstly, 22^ is used 
to train these two components, resulting in Pi and 
Pi. The parser Pi then generates a set of lists of k 
candidates a:22i (whose Cartesian product results in 
N(22i)) for the set of training sentences S. The best 
candidates, according to reranker Pi, are collected 
to form 222 for the next iteration. This process is 
halted when a pre-defined stop criterion is met. ^ 

It is certain that we can, as in the work of 
Spitkovsky et al. (2010b) and many bootstrapping 
approaches, employ only parser P. Reranking, how¬ 
ever, brings us two benefits. Eirst, it allows us to em¬ 
ploy very expressive models like the 00 -order gen¬ 
erative model proposed by Ee and Zuidema (2014). 
Second, it embodies a similar idea to co-training 
(Blum and Mitchell, 1998): P and P play roles as 
two views of the data. 

*It is worth noting that, although N (Vi) has the size 0{k”) 
where n is the number of sentences, reranking only needs to 
process 0{k x n) parses if these sentences are assumed to be 
independent. 



3.3 Multi-phase Iterated Reranking 

Training in machine learning often uses starting big 
which is to use up all training data at the same time. 
However, Rim an (1993) suggests that in some cases, 
learning should start by training simple models on 
small data and then gradually increase the model 
complexity and add more difficult data. This is 
called starting small. 

In unsupervised dependency parsing, starting 
small is intuitive. For instance, given a set of long 
sentences, learning the fact that the head of a sen¬ 
tence is its main verb is difficult because a long sen¬ 
tence always contains many syntactic categories. It 
would be much easier if we start with only length- 
one sentences, e.g “Look!”, since there is only 
one choice which is usually a verb. This training 
scheme was successfully applied by Spitkovsky et 
al. (2010a) under the name: Baby Step. 

We adopt starting small to construct the multi¬ 
phase iterated reranking (MPIR) framework. In 
phase 0, a parser M with a simple model is trained 
on a set of short sentences as in traditional ap¬ 
proaches. This parser is used to parse a larger set 
of sentences 5 resulting in is 

then used as the starting point for the iterated rerank¬ 
ing in phase 1. We continue this process until phase 
N finishes, wifh 5 (i = 1..N). In gen¬ 

eral, we use fhe resulfing reranker in fhe previous 
phase fo generafe fhe sfarfing poinf for fhe iferafed 
reranking in fhe currenf phase. 

4 Le and Zuidema (2014)’s Reranker 

Le and Zuidema (2014)’s reranker is an exception 
among supervised parsers because if employs an ex- 
fremely expressive model whose fealures are oo- 
order^. To overcome fhe problem of sparsify, fhey 
infroduced fhe inside-oufside recursive neural nef- 
work (lORNN) archifecfure fhaf can esfimafe free- 
generating models including fhose proposed by Eis¬ 
ner (1996) and Collins (2003a). 

4,1 The oo-order Generative Model 

Le and Zuidema (2014)’s reranker employs fhe gen¬ 
erative model proposed by Eisner (1996). Infu- 
ifively. Ibis model is fop-down: sfarfing wifh ROOT, 

^In fact, the order is finite but unbound. 



Eigure 1: Inside-Oufside Recursive Neural Nefwork 
(lORNN). Black/while recfangles correspond fo in- 
ner/oufer represenfafions. 

we generate ifs leff dependenfs and ifs righf de- 
pendenfs. We fhen generafe dependenfs for each 
root’s dependenf. The generative process recur¬ 
sively continues until fhere is no dependenf fo gen¬ 
erafe. Eormally, fhis model is described by fhe fol¬ 
lowing formula 

L 

P{d{H)) =l[P{Ht\CiHt)) P (dint)) X 

l[p{H,^\C{H,^))P{diH^)) (3) 

r— 1 

where H is fhe currenf head, d{N) is fhe fragmenf 
of fhe dependency parse roofed af N, and C{N) 
is fhe confexf fo generafe N. are respec¬ 

tively Ef’s leff dependenfs and righf dependenfs, plus 
EOC (End-Of-Children), a special token fo inform 
fhaf fhere are no more dependenfs fo generafe. Thus, 
P{d{ROOT)) is fhe probabilify of generating fhe 
enfire dependency sfrucfure d. 

Le and Zuidema’s oo-order generative model is 
defined as Eisner’s model in which fhe confexf 
C°°{D) fo generafe D confains all of EE’s generated 
siblings, ifs ancesfors and fheir siblings. Because 
of very large fragmenls fhaf confexfs are allowed fo 
hold, fradifional counf-based mefhods are imprac¬ 
tical (even if we use smarf smoofhing techniques). 
They fhus infroduced fhe lORNN archifecfure fo es¬ 
fimafe fhe model. 

4.2 Estimation with the lORNN 

An lORNN (Eigure 1) is a recursive neural network 
whose topology is a tree. What make this network 
different from traditional RNNs (Socher et ah, 2010) 
is that each tree node u caries two vectors: lu - the 
inner representation, represents the content of the 



phrase covered by the node, and - the outer repre¬ 
sentation, represents the context around that phrase. 
In addition, information in an lORNN is allowed to 
flow not only bottom-up as in RNNs, but also top- 
down. That makes lORNNs a natural tool for esti¬ 
mating top-down tree-generating models. 

Applying the lORNN architecture to dependency 
parsing is straightforward, along the generative story 
of the oo-order generative model. First of all, the 
“inside” part of this lORNN is simpler than what 
is depicted in Figure 1: the inner representation of 
a phrase is assumed to be the inner representation 
of its head. This approximation is plausible since 
the meaning of a phrase is often dominated by the 
meaning of its head. The inner representation at 
each node, in turn, is a function of a vector repre¬ 
sentation for the word (in our case, the word vectors 
are initially borrowed from Collobert et al. (2011)), 
the POS-tag and capitalisation feature. 

Without loss of generality and ignoring directions 
for simplicity, they assume that the model is generat¬ 
ing dependent u for node h conditioning on context 
C°^{u) which contains all of tt’s ancestors (includ¬ 
ing h) and theirs siblings, and all of previously gen¬ 
erated u’s sisters. Now there are two types of con¬ 
texts: full contexts of heads (e.g., h) whose depen¬ 
dents are being generated, and contexts to generate 
nodes (e.g., C°^{u)). Contexts of the first type are 
clearly represented by outer representations. Con¬ 
texts of the other type are represented by partial 
outer representations, denoted by o^. Because the 
context to generate a node can be constructed recur¬ 
sively by combining the full context of its head and 
its previously generated sisters, they can compute Ou 
as a function of and the inner representations of 
its previously generated sisters. On the top of 6„, 
they put a softmax layer to estimate the probability 
P(x|C~(u)). 

Training this lORNN is to minimise the cross en¬ 
tropy over all dependents. This objective function is 
indeed the negative log likelihood P{T>) of training 
treebank V. 

4.3 The Reranker 

Le and Zuidema’s (generative) reranker is given by 
d* = arg max P{d) 

d&kDep{s) 


where P (Equation 3) is computed by the oo-order 
generative model which is estimated by an lORNN; 
and kDep{s) is a fc-best list. 

5 Complete System 

Our system is based on the multi-phase IR. In gen¬ 
eral, any third-party parser for unsupervised depen¬ 
dency parsing can be used in phase 0, and any third- 
party parser that can generate fc-best lists can be used 
in the other phases. In our experiments, for phase 0, 
we choose the parser using an extension of the DMV 
model with stop-probability estimates computed on 
a large corpus proposed by Marecek and Straka 
(2013). This system has a moderate performance^ 
on the WSJ corpus: 57.1% vs the SOTA 64.4% DDA 
of Spitkovsky et al. (2013). For the other phases, we 
use the MSTParser"^ (with the second-order feature 
mode) (McDonald and Pereira, 2006). 

Our system uses Le and Zuidema (2014)’s 
reranker (Section 4.3). It is worth noting that, in 
this case, each phase with iterated reranking could 
be seen as an approximation of hard-EM (see Equa¬ 
tion 2) where the first step is replaced by 

Vi+i = arg max Pq^ (V) (4) 

»eN(Di) 

In other words, instead of searching over the tree- 
bank space, the search is limited in a neighbour set 
N('Dj) generated by fc-best parser P*. 

5.1 Tuning Parser P 

Parser Pi trained on Vi defines neighbour set N (D,) 
which is the Cartesian product of the /c-best lists in 
kVi. The position and shape of N(I7j) is thus deter¬ 
mined by two factors: how well Pi can fit Vi, and k. 
Intuitively, the lower the fitness is, the more N(I7j) 
goes far away from Vp, and the larger k is, the larger 

^Marecek and Straka (2013) did not report any experimental 
result on the WSJ corpus. We use their source code at http : 
/ /uf al. mf f. cuni . cz /udp with the setting presented in 
Section 6.1. Because the parser does not provide the option to 
parse unseen sentences, we merge the training sentences (up to 
length 15) to all the test sentences to evaluate its performance. 
Note that this result is close to the DDA (55.4%) that the authors 
reported on CoNLL 2007 English dataset, which is a portion of 
the WSJ corpus. 

"'http: / / source forge .net/projects/ 
mstparser/ 



N('Dj) is. Moreover, the diversity of N('Dj) is in¬ 
versely proportional to the fitness. When the fitness 
decreases, patterns existing in the training treebank 
become less certain to the parser, patterns that do not 
exist in the training treebank thus have more chances 
to appear in fc-best candidates. This leads to high di¬ 
versity of N('Dj). We blindly set A: = 10 in all of 
our experiments. 

With the MSTParser, there are two hyper¬ 
parameters: itersMST? the number of epochs, and 
training-k]viST> the fc-best parse set size to cre¬ 
ate constraints during training. training-kMST 
is always 1 because constraints from /c-best parses 
with almost incorrect training parses are useless. 

Because itersMSX controls the fitness of the 
parser to training treebank Vi, it, as pointed out 
above, determines the distance from N{Vi) to Vi 
and the diversity of the former. Therefore, if we 
want to encourage the local search to explore more 
distant areas, we should set itersMSX low. In our 
experiments, we test two strategies: (i) MaxEnc, 
itersMST = T maximal encouragement, and (ii) 
MinEnc, itersMsx = 10, minimal encouragement. 

5.2 Tuning Reranker R 

Tuning the reranker R is to set values for dimioRNN> 
the dimensions of inner and outer representations, 
and itersioRNN, the number of epochs to train the 
lORNN. Because the oo-order model is very expres¬ 
sive and feed-forward neural networks are universal 
approximators (Cybenko, 1989), the reranker is ca¬ 
pable of perfectly remembering all training parses. 
In order to avoid this, we set dimioRNN = 50, and 
set itersioRNN = 5 for very early stopping. 

5.3 Tuning multi-phase IR 

Because Marecek and Straka (2013)’s parser does 
not distinguish training data from test data, we pos¬ 
tulate 5o = 5i. Our system has N phases such that 
5o,5i contain all sentences up to length li = 15, 
Si {i = 2..N) contains all sentences up to length 
li = k-i + 1, and S]\f contains all sentences up to 
length 25. Phase 1 halts after 100 iterations whereas 
all the following phases run with one iteration. Note 
that we force the local search in phase 1 to run in¬ 
tensively because we hypothesise that most of the 
important patterns for dependency parsing can be 
found within short sentences. 


6 Experiments 

6.1 Setting 

We use the Penn Treebank WSJ corpus: sections 
02-21 for training, and section 23 for testing. We 
then apply the standard pre-processing^ for unsu¬ 
pervised dependency parsing task (Klein and Man¬ 
ning, 2004): we strip off all empty sub-trees, punc¬ 
tuation, and terminals (tagged # and $) not pro¬ 
nounced where they appear; we then convert the re¬ 
maining trees to dependencies using Collins’s head 
rules (Collins, 2003b). Both word forms and gold 
POS tags are used. The directed dependency accu¬ 
racy (DDA) metric is used for evaluation. 

The vocabulary is taken as a list of words occur¬ 
ring more than two times in the training data. All 
other words are labelled ‘UNKNOWN’ and every 
digit is replaced by ‘O’. We initialise the lORNN 
with the 50-dim word embeddings from Collobert et 
al. (2011) ^ , and train it with the learning rate 0.1, 

6.2 Results 

We compare our system against recent systems (Ta¬ 
ble 1 and Section 2.1). Our system with the two en¬ 
couragement levels, MinEnc and MaxEnc, achieves 
the highest reported DDAs on section 23: 1.8% and 
1.2% higher than Spitkovsky et al. (2013) on all sen¬ 
tences and up to length 10, respectively. Our im¬ 
provements over the system’s initialiser (Marecek 
and Straka, 2013) are 9.1% and 4.4%. 

6.3 Analysis 

In this section, we analyse our system along two as¬ 
pects. Eirst, we examine three factors which deter¬ 
mine the performance of the whole system: encour¬ 
agement level, lexical semantics, and starting point. 
We then search for what IR (with the MaxEnc op¬ 
tion) contributes to the overall performance by com¬ 
paring the quality of the treebank resulted in the end 
of phase 1 against the quality of the treebank given 
by its initialier, i.e. Marecek and Straka (2013). 

The effect of encouragement level 

Eigure 2 shows the differences in DDA between 
using MaxEnc and MinEnc in each phase: we com- 

^http://www.cs.famaf.unc.edu.ar/ 
~francolq/en/proyectos/dmvccm 

®http: //ml. nec-labs . com/senna/. These word 
embeddings were unsupervisedly learnt from Wikipedia. 



System 

DDA (@ 10) 

Bisk and Hockenmaier (2012) 

53.3 (71.5) 

Blunsom and Cohn (2010) 

55.7 (67.7) 

Tu and Honavar (2012) 

57.0(71.4) 

Marecek and Straka (2013) 

57.1 (68.8) 

Naseem and Barzilay (2011) 

59.4 (70.2) 

Spitkovsky et al. (2012) 

61.2 (71.4) 

Spitkovsky et al. (2013) 

64.4 (72.0) 

Our system (MinEnc) 

66.2 (72.7) 

Our system (MaxEnc) 

65.8 (73.2) 


Table 1: Performance on section 23 of the WSJ cor¬ 
pus (all sentences and up to length 10) for recent sys¬ 
tems and our system. MinEnc and MaxEnc denote 
itersMST = 10 and itersMsx = 1 respectively. 
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phase (max sen. ten.) 

Eigure 2: DDAmoxEuc - DDAMinEnc of all phases 
on the their training sets (e.g., phase 3 with con¬ 
taining all training sentences up to length 17). 

pute DDAMaxEnc — DDAMinEnc of cach phase on its 
training set (e.g., phase 3 with containing all 
training sentences up to length 17). MinEnc outper¬ 
forms MaxEnc within phases 1, 2, 3, and 4. How¬ 
ever, from phase 5, the latter surpasses the former. It 
suggests that exploring areas far away from the cur¬ 
rent point with long sentences is risky. The reason 
is that long sentences contain more ambiguities than 
short ones; thus rich diversity, high difference from 
the current point, but small size (i.e., small k) could 
easily lead the learning to a wrong path. 

The performance of the system with the two en¬ 
couragement levels on section 23 (Table 1) also sug¬ 
gests the same. MaxEnc strategy helps the system 
achieve the highest accuracy on short sentences (up 
to length 10). However, it is less helpful than Mi¬ 
nEnc when performing on long sentences. 



DDA 

Eigure 3: DDA of phase 1 (MaxEnc), with and with¬ 
out the word embeddings (denoted by w/ sem and 
wo/ sem, respectively), on training sentences up to 
length 15 (i.e. 5^^^). 

w/GGGPT 
w/ Harmonic 

30 40 50 60 70 

DDA 

■ after train ■ before train 

Eigure 4: DDA of phase 1 (MaxEnc) before and af¬ 
ter training with three different starting points pro¬ 
vided by three parsers used in phase 0: MS (Mare- 
cek and Straka, 2013), GGGPT (Gillenwater et ah, 
2011), and Harmonic (Klein and Manning, 2004). 

The role of lexical semantics 

We examine the role of the lexical semantics, 
which is given by the word embeddings. Eig¬ 
ure 3 shows DDAs on training sentences up to 
length 15 (i.e. 5^^^) of phase 1 (MaxEnc) with 

and without the word-embeddings. With the word- 
embeddings, phase 1 achieves 71.11%. When the 
word-embeddings are not given, i.e. the lORNN 
uses randomly generated word vectors, the accuracy 
drops 4.2%. It shows that lexical semantics plays a 
decisive role in the performance of the system. 

However, it is worth noting that, even without that 
knowledge (i.e., with the oo-order generative model 
alone), the DDA of phase 1 is 2% higher than before 
being trained (66.89% vs 64.9%). It suggests that 
phase 1 is capable of discovering some useful de¬ 
pendency patterns that are invisible to the parser in 
phase 0. This, we conjecture, is thanks to high-order 
features captured by the lORNN. 

The importance of the starting point 

Starting point is claimed to be important in lo¬ 
cal search. We examine this by using three differ¬ 
ent parsers in phase 0: (i) MS (Marecek and Straka, 
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Figure 5: Precision (top) and recall (bottom) over 
binned HEAD distance of iterated reranking (IR) 
and its initializer (MS) on the training sentences in 
phase 1 (< 15 words). 



2013), the parser used in the previous experiments, 
(ii) GGGPT (Gillenwater et al., 2011)^ employing 
an extension of the DMV model and posterior reg¬ 
ularization framework for training, and (iii) Har¬ 
monic, the harmonic initializer proposed by Klein 
and Manning (2004). 

Figure 4 shows DDAs of phase 1 (MaxEnc) 
on training sentences up to length 15 with three 
starting-points given by those parsers. Starting point 
is clearly very important to the performance of the 
iterated reranking: the better the starting point is, 
the higher performance phase 1 has. However, a 
remarkable point here is that the iterated reranking 
of phase 1 always finds out more useful patterns for 
parsing whatever the starting point is in this experi¬ 
ment. It is certainly due to the high order features 
and lexical semantics, which are not exploited in 
those parsers. 

The contribution of Iterated Reranking 

We compare the quality of the treebank resulted in 
the end of phase 1 against the quality of the treebank 
given by the initialier Marecek and Straka (2013). 
Figure 5 shows precision (top) and recall (bottom) 

'^code . google . com/p/pr-toolkit 


over binned HEAD distance. IR helps to improve 
the precision on all distance bins, especially on the 
bins corresponding to long distances (> 3). The re¬ 
call is also improved, except on the bin correspond¬ 
ing to > 7 (but the El-score on this bin is increased). 
We attribute this improvement to the oo-order model 
which uses very large fragments as contexts thus be 
able to capture long dependencies. 

Figure 6 shows the correct-head accuracies over 
POS-tags. IR helps to improve the accuracies over 
almost all POS-tags, particularly nouns (e.g. NN, 
NNR NNS), verbs (e.g. VBD, VBZ, VBN, VBG) 
and adjectives (e.g. JJ, JJR). However, as being af¬ 
fected by the initializer, IR performs poorly on con¬ 
junction (CC) and modal auxiliary (MD). For in¬ 
stance, in the treebank given by the initializer, al¬ 
most all modal auxilaries are dependents of their 
verbs instead of the other way around. 

7 Discussion 

Our system is different from the other systems 
shown in Table 1 as it uses an extremely expressive 
model, the oo-order generative model, in which con¬ 
ditioning contexts are very large fragments. Only 
the work of Blunsom and Cohn (2010), whose re¬ 
sulting grammar rules can contain large tree frag¬ 
ments, shares this property. The difference is that 
their work needs a pre-defined prior, namely hierar¬ 
chical non-parametric Pitman-Yor process prior, to 
avoid large, rare fragments and for smoothing. The 
lORNN of our system, in contrast, does that auto¬ 
matically. It learns by itself how to deal with dis¬ 
tant conditioning nodes, which are often less infor¬ 
mative than close conditioning nodes on computing 
P{x\C°°{u)). In addition, smoothing is given free: 
recursive neural nets are able to map ‘similar’ frag¬ 
ments onto close points (Socher et al., 2010) thus 
an unseen fragment tends to be mapped onto a point 
close to points corresponding to ‘similar’ seen frag¬ 
ments. 

Another difference is that our system exploits lex¬ 
ical semantics via word embeddings, which were 
learnt unsupervisedly. By initialising the lORNN 
with these embeddings, the use of this knowledge 
turns out easy and transparent. Spitkovsky et al. 
(2013) also exploit lexical semantics but in a limited 
way, using a context-based polysemous unsuper- 
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Figure 6: Correct-head accuracies over POS-tags (sorted in the descending order by frequency) of iterated 
reranking (IR) and its initializer (MS) on the training sentences in phase 1 (< 15 words). 


vised clustering method to tag words. Although their 
approach can distinguish polysemes (e.g., ‘cool’ in 
‘to cool the selling panic’ and in ‘it is cool’), it is not 
able to make use of word meaning similarities (e.g., 
the meaning of ‘dog’ is closer to ‘animal’ than to 
‘table’). Naseem and Barzilay (2011)’s system uses 
semantic cues from an out-of-domain annotated cor¬ 
pus, thus is not fully unsupervised. 

We have showed that IR with a generative 
reranker is an approximation of hard-EM (see Equa¬ 
tion 4). Our system is thus related to the works of 
Spitkovsky et al. (2013) and Tu and Honavar (2012). 
However, what we have proposed is more than 
that: IR is a general framework that we can have 
more than one option for choosing k-best parser and 
reranker. Eor instance, we can make use of a gener¬ 
ative k-best parser and a discriminative reranker that 
are used for supervised parsing. Our future work is 
to explore this. 

The experimental results reveal that starting point 
is very important to the iterated reranking with the 
cx)-order generative model. On the one hand, that 
is a disadvantage compared to the other systems, 
which use uninformed or harmonic initialisers. But 
on the other hand, that is an innovation as our ap¬ 
proach is capable of making use of existing systems. 
The results shown in Eigure 4 suggest that if phase 0 
uses a better parser which uses less expressive model 
and/or less external knowledge than our model, such 
as the one proposed by Spitkovsky et al. (2013), we 
can expect even a higher performance. The other 
systems, except Blunsom and Cohn (2010), how¬ 
ever, might not benefit from using good existing 


parsers as initializers because their models are not 
significantly more expressive than others 

8 Conclusion 

We have proposed a new framework, iterated rerank¬ 
ing (IR), which trains supervised parsers without the 
need of manually annotated data by using a unsu¬ 
pervised parser as an initialiser. Our system, em¬ 
ploying Marecek and Straka (2013)’s unsupervised 
parser as the initialiser, the fc-best MSTParser, and 
Ee and Zuidema (2014)’s reranker, achieved 1.8% 
DDA higher than the SOTA parser of Spitkovsky et 
al. (2013) on the WSJ corpus. Moreover, we also 
showed that unsupervised parsing benefits from lex¬ 
ical semantics through using word-embeddings. 

Our future work is to exploit other existing super¬ 
vised parsers that fit our framework. Besides, taking 
into account the fast development of the word em¬ 
bedding research (Mikolov et ah, 2013; Pennington 
et ah, 2014), we will try different word embeddings. 
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®In an experiment, we used the Marecek and Straka (2013)’s 
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